Generate a Sequence With No Duplicate Uniformly at Random

Given a continuous range - , and the length of the sequence - , how can a sequence with no duplicates be generated uniformly at random? For a sequence , we say it has no duplicate if and only if .

This is very useful for sampling. For example, you have an array , and you want to select some samples to calculate their variance. The general way is to generate an index sequence and select the element with the generated indices. For example, in Python we use random.sample to do this.

Here I will show some ways to generate such a sequence.

Method One

Let's consider the simplest algorithm. We can put the entire value range directly into an array and then shuffle it. If we take the first elements, we get a generation complexity of . However, shuffling has a high constant factor, so the whole process can be seen as , where is a constant factor. Is there a better approach?

Actually, there's an even simpler idea, but with a flaw: just randomly pick numbers to construct a sequence. The flaw here is the possibility of generating duplicate numbers. How can we fix this? A simple solution is to put the entire range of values into a treap, select elements from it, and then remove them. This approach would have a complexity of , but with a large constant factor, which is not ideal. The solution is also quite simple: put the whole range of values into a std::vector , then randomly choose an index , put the corresponding element into the target vector , swap with the last element of , and then pop it off. The total complexity remains , but due to constants associated with operations like random number generation, the entire process can be considered as . Since , this approach is consistently no worse than the previous one, and may be better in certain cases.

Analyzing correctness: Assuming the current selection is the -th number and the range of values is , then the probability of all previous selections not containing is:

The probability of selecting at the current step is . Therefore, the probability of selecting is . This is indeed a uniform distribution.

Method 2

It's the same naive idea as before: just generate without worrying about duplicates. What if we encounter duplicates? Don't worry about it! Remember how we calculate quadratic residues? We pick a number at random, and if it's a duplicate, we don't care. We can just pick another one. The only concern is that after many selections, we might run into duplicates again and again.

However, since we are picking numbers randomly, randomness is built in :) So we have to calculate the expectation.

First, let's solve a similar problem:

Perform a task. Each time there's a probability of failure and a probability of success. After a failure, we keep trying until success, then stop. Find the expected number of steps to reach success. The expected number can be calculated as follows

This is just a geometric series! It becomes

Oops! The expected number of successes is the inverse of the probability of success! (Although it's a classic result and seems quite intuitive QwQ)

We want to find the complexity of randomly choosing a number that did not appear, so it suffices to compute an upper bound. Obviously, the maximum complexity occurs when selecting the -th number. At this point, the probability of success is . Let , then the expected complexity is

So if we choose numbers, the total complexity will be . However, checking whether a number has appeared before may require using a self-balancing BST due to the large range of values, resulting in a complexity of . By omitting , a significant improvement is achieved!

Frankenstein's Monster (Patchwork Solution)

However, the above algorithm cannot satisfy all scenarios. For instance, when , each selection incurs an enormous cost, leading to a shocking expected complexity of .

Then, similar to the approach for some problems, we can choose the method based on which algorithm is faster. Obviously, if is larger, the method from the discussion below is preferred; otherwise, the method from the previous discussion is adopted. But when exactly? Just by solving the equation below, we can determine the threshold:

Which leads to:

From this, we derive that . Therefore, let's set a threshold . When , we adopt the former approach; otherwise, we go with the latter. Analysis of the complexity: when , since , the complexity is . When , , which can be regarded as a constant, so it's also . The overall complexity is .

Further Optimization

Optimization 1

Of course, there's more room for optimization. For instance, if , we can directly apply the approach discussed earlier, or use std::sort for constant factor optimizations in time and space. This results in a time and space complexity of . However, given the reasonable threshold derived earlier, this optimization may not be critical.

Optimization 2

Another optimization concerns the time complexity. Checking whether a number has appeared can be done not only with a BST but also with a hash table. This would improve the time complexity, but the constant factors can be very large. In this case, the complexity of the second method would become . Solve the equation:

Solving this, we get . So, let's set . If , use the first approach; otherwise, use the second one. Further analysis of the complexity: When , , and can be regarded as a constant. Thus, the complexity is ; when , , which can be seen as a constant, resulting in complexity. The total complexity is .

Optimization 3

Still related to checking whether a number has appeared. When is not large and there's enough memory space, we can use the common 'vis' array for the check. Using std::bitset can greatly optimize constants. When , consider using this method for constant optimizations.

Optimization 4

In many cases, we don't require such randomness. Therefore, we can divide the value range equally into blocks and solve each block separately. This keeps the time complexity unchanged while reducing the space complexity to or .

Method 3 / Optimization 5

While I was sleeping I thought of another method, method 3. I'm not sure if it's correct. So if you think it is wrong, please tell me in the comments. Specifically, we first randomly select numbers, then sort and remove duplicates (use radix sorting to accumulate). Let's say we have numbers left. Now set and solve recursively. I can't prove the complexity here, so let's just present a table:

            n/V             Exp
         0.0001               1
         0.0002               1
         0.0004               1
         0.0008               1
         0.0016               1
         0.0032               1
         0.0064               2
         0.0128               2
         0.0256               2
         0.0512               3
         0.1024               4
         0.2048               6
         0.4096              11
         0.8192            47.7

represents the average number of times required to obtain at least numbers using the method mentioned above.

It's evident that when is relatively small, the expected number of attempts is very small. Therefore, within the above method, if is large ( is very small), applying this method to generate the first or first numbers, and then using Method 2 to generate the rest, can slightly optimize the constant factors in time and space.

Attached is the code for generating the table: tester.cpp.


After I wrote the implementation of the method above, I realize that if I replace Method 2 with Method 3, will it be faster?

It seems that in Method 3, the constant becomes too large when the ratio is inappropriate. But compared to the constant in Method 2, is this constant truly large? In addition, the ratio should be less than 0.5, so the constant reaches a maximum of 20. Is this really more optimal?

Indeed it is.

Here is the table using only Method 3 (unsigned int): M3table.

The table above represents the performance of Method 3 when fully utilizing . We don't use the n/V=0.8 group, so it looks good.

Running data points is 3 to 6 times faster than the hash approach.

It appears be slower in cases where the range of value is very small compared to the code above. However, we have a std::bitset approach that strictly outperforms the hash approach for small value ranges. Combining these methods results in an optimal solution!

For Method 3, we can still optimize! We noticed that the generated numbers are ordered among themselves, only the newly generated numbers are unordered. So we just need to sort and merge the newly generated numbers. Less constant time, get it!

For Optimization 3, there's even more optimization. If we want to generate sequence many times, we can pretreat the bitset for better constants. That is, if you use std::bitset::reset, it'll be faster than declaring one.

Here's the final code: UNIG.cpp.

Then we can check whether the program is correct using the program below for , where is the lower bound, is the upper bound:

#include<iostream>
#include<cstdio>
#include<cctype>
#include<vector>
#include<cstring>
#include<algorithm>
#include<ctime>
#include<random>
#include<cctype>
#include<iomanip>
using namespace std;

#define int long long

int x[10000005];
int cnt[10000001];

signed main() {
    freopen("test.in","r",stdin);
    int n,V=1e9;
    cin>>n;
    for(int i=1;i<=n;i++) cin>>x[i];
    double avg=0;
    double var=0;
    cout.precision(10);
    for(int i=1;i<=n;i++) avg+=x[i];
    avg/=n;
    cout<<"Average number: "<<avg<<endl;
    for(int i=1;i<=n;i++) var+=(x[i]-avg)*(x[i]-avg);
    cout<<"Variance value: "<<var/(n-1)<<endl;
    return 0;
}

Result:

Average number: 499849719.6
Variance value: 8.33604178e+16

We know mathematically that the expected mean is and the variance is . That's consistent with the forecast.


In this edition, I only write the important content. There's a Chinese edition of this article. Also, it's the former one. In this edition, there are more interesting things in it. But they are not important.

Thanks to ChatGPT and DeepL. They helped me to improve the English expression of this blog.